Comparison Analysis of Dimensionality Reduction Techniques for Torus Data v24.08.01
Author: Sindhu, Sasidhar
Created: August 02nd, 2024 Modified: August 07th, 2024

Objective

In this notebook, our objective is to explore high-dimensional Torus data represented as embeddings generated through the text-embedding-3-small model. We will employ dimensionality reduction methods such as Sammon's mapping, t-SNE, and UMAP to transform these embeddings into a lower-dimensional space (1D, 2D, 3D) while preserving their intrinsic structure. Our aim is to compare these techniques using various metrics and to analyze how clusters are distributed across different industry categories.

Methods of Dimensionality Reduction

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique used for visualizing high-dimensional data by reducing it to 1, 2, 3 dimensions while preserving local relationships between points. This transformation makes it easier to visualize and interpret complex data patterns.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.

  • perplexity - Perplexity is a crucial hyperparameter in t-SNE that controls how the algorithm balances local and global aspects of the data. It essentially determines the number of nearby data points each point should consider when forming its neighborhood. A value of 30, as used in the code, is a common choice that provides a good balance. It allows t-SNE to capture meaningful local clusters while still considering broader data relationships.

  • learning_rate - In t-SNE learning rate controls the step size for the optimization algorithm as it minimizes the Kullback-Leibler divergence between the high-dimensional and low-dimensional representations of the data. It determines how much the model adjusts its parameters during each iteration. The range of values might be between 200 and 1000. Setting learning_rate to 200 ensures effective adjustment of parameters during optimization, crucial for high-dimensional data like yours.

  • n_iter - The number of iterations determines how long the t-SNE algorithm runs to minimize the Kullback–Leibler divergence between the high-dimensional and low-dimensional representations of the data. This process is crucial for achieving a stable and meaningful embedding. Setting n_iter to 1000 as it provides enough iterations to achieve stable and well-optimized embeddings for your dataset size.

  • random_state - The random_state parameter sets the seed for the random number generator used during the algorithm's execution. This seed determines the starting point for generating random numbers, which influences the initial conditions of the algorithm.

UMAP (Uniform Manifold Approximation and Projection)

UMAP reduces dimensionality while maintaining both local and global data structures, making it suitable for understanding fine-grained and broad relationships. It is faster and more scalable than t-SNE, ideal for large datasets.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.

  • n_neighbors - This parameter determines the number of nearest neighbors each point considers when constructing the local neighborhood. It controls how UMAP balances between preserving local and global structure. In the code, n_neighbors is set to 60. This value helps to capture both local clusters and broader patterns in the data, providing a good balance between detail and global structure.

  • min_dist - Controls the minimum distance between embedded points in the lower-dimensional space. Lower values allow points to be closer together, preserving more local structure and detail. Higher values result in a more spread-out embedding, emphasizing the global structure and reducing local cluster tightness. Setting min_dist to 0.1 maintains some local structure while avoiding excessive crowding of points. This value helps ensure that points are reasonably close to each other in the embedding space without losing broader patterns.

  • random_state - Sets the seed for the random number generator to ensure reproducibility. Setting to 42 ensures reproducibility of results. Setting this seed makes sure that the UMAP results are consistent across different runs, which is essential for reliable analysis and comparisons.

Sammon's Mapping

Sammon's Mapping is a technique used to reduce the dimensionality of high-dimensional data while preserving the inter-point distances as well as possible. This technique is particularly focused on preserving the distances between points, which means it tries to keep the relative distances between points in the lower-dimensional space as close as possible to those in the original high-dimensional space.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.

  • n_iter - It specifies the number of iterations for optimizing the embedding. It controls how many times the algorithm refines the positions of points to minimize the distance differences between the original and reduced spaces. Setting n_iter to 1000 is chosen as the algorithm has enough iterations to converge to a stable and meaningful solution.

Evaluation Metrics

To compare and evaluate the effectiveness of dimensionality reduction techniques, we use several metrics.

1. Structure Preservation

  • Trustworthiness: Evaluates how well the local structure (k-nearest neighbors) is preserved after dimensionality reduction. Higher scores indicate better retention of local relationships.

  • Continuity: Measures how well the overall local structure of the data is preserved in the reduced space. It assesses the consistency of k-nearest neighbors between the original and reduced spaces.

2. Distance and Metric Quality

  • Root Mean Squared Error (RMSE) of Distances: Calculates the average deviation of pairwise distances between points in the original and reduced spaces. Lower RMSE values indicate better preservation of distances between points.

3. Visualization Quality

  • Silhouette Score: Measures how well-separated clusters are by evaluating each point’s similarity to its own cluster versus other clusters. A higher score indicates better-defined clusters.

  • K-Nearest Neighbor (KNN) Retention: Measures how well the local distances between points are preserved in the reduced space by evaluating the retention of k-nearest neighbors. A higher score indicates better preservation of the local structure.

Aggregate Score

Aggregating evaluation metrics for dimensionality reduction involves combining multiple performance metrics into a single score to facilitate comparison. To aggregate evaluation metrics for dimensionality reduction, sum the scores of Trustworthiness (T), Continuity (C), Silhouette Score (S), and KNN Retention (R) and subtract the Root Mean Squared Error (RMSE) from this aggregate score to account for distance preservation. The result is divided by 5 to normalize the final score, which represents the equal-weighted average of all metrics. This method highlights techniques that excel in preserving both local and global structures while minimizing distance distortion.

Aggregate Score = (T + C + S + R - RMSE) / 5

Torus Embeddings

We will apply dimensionality reduction techniques to the Torus data, followed by hierarchical clustering to identify clusters. The dimensionality reduction will transform the high-dimensional data into a lower-dimensional space, preserving essential structures and relationships. After this transformation, hierarchical clustering will group data points based on their proximity, forming a dendrogram. The fcluster function will cut this dendrogram at a specified height (t=8) to produce distinct clusters. We will then evaluate the effectiveness of the dimensionality reduction and clustering using metrics.

Data Loading and preprocessing

In [23]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
from sklearn.decomposition import PCA

# Function to generate a 3D torus dataset
def generate_torus(n, R, r):
    theta = np.random.uniform(0, 2 * np.pi, n)
    phi = np.random.uniform(0, 2 * np.pi, n)
    x = (R + r * np.cos(theta)) * np.cos(phi)
    y = (R + r * np.cos(theta)) * np.sin(phi)
    z = r * np.sin(theta)
    return np.column_stack((x, y, z))

# Parameters
n = 1000  # Number of points
R = 5      # Major radius
r = 2      # Minor radius

# Generate the torus dataset
torus_data = generate_torus(n, R, r)

# Create DataFrame and add serial number column
data = pd.DataFrame(torus_data, columns=['x1', 'y1', 'z1'])
data['sno'] = np.arange(1, n + 1)

name = data['sno']
x1 = data['x1']
y1 = data['y1']
z1 = data['z1']

# Display the DataFrame
print(data.head())

embeddings = data.drop(['sno'], axis=1).values
         x1        y1        z1  sno
0  1.394774  2.824274  0.759710    1
1 -4.803936  5.085617  0.129463    2
2  2.043712 -6.001685 -1.484624    3
3  2.707566  1.336426 -0.278094    4
4  6.725266  1.344903  0.739097    5

t-SNE for Torus Embeddings

t-SNE - 1D

In [7]:
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings
import time
warnings.filterwarnings("ignore")
 
# Start the timer
start_time = time.time()
 
# Apply t-SNE with 1D
tsne_embeddings_1D = TSNE(n_components=1, perplexity=30, learning_rate=200, n_iter=1000, random_state=42).fit_transform(embeddings)
 
# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_1D[:, 0],
    'x2': x1,
    'y2': y1,
    'z2': z1,
    'Name': name
}
tsne_result_df = pd.DataFrame(tsne_result_data)
 
# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1']].values
distance_matrix = pdist(X, metric='euclidean')
 
Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_1D = fcluster(Z, t=8, criterion='maxclust')
 
 
# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_1D
 
# Highlight specific nodes
highlight_names = [237, 650, 489, 469]  # Use appropriate names or IDs within range
tsne_result_df['color'] = tsne_result_df['Name'].apply(lambda x: 'red' if x in highlight_names else 'blue')
 
# Save results to CSV
tsne_output_csv_path = 'tsne_industry_1D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)
 
# Create interactive plot for 1D t-SNE embeddings
tsne_fig_1D = px.scatter(tsne_result_df, x='t-SNE Dimension 1', y=[0] * len(tsne_result_df),
                      color='color',
                      title='t-SNE Projection (1D)',
                      color_discrete_map={'red': 'red', 'blue': 'blue'},
                      hover_data={'TSNE_Cluster': True, 'Name': True})
 
# Update the plot to remove the legend
tsne_fig_1D.update_traces(marker=dict(size=5), showlegend=False)
 
# Add text annotations for the highlighted nodes
annotations = []
for name in highlight_names:
    node = tsne_result_df[tsne_result_df['Name'] == name]
    annotations.append(go.layout.Annotation(
        x=node['t-SNE Dimension 1'].values[0],
        y=0,
        text=str(name),
        showarrow=True,
        arrowhead=5,
        arrowsize=2,  
        arrowcolor='red',
        ax=40,
        ay=-30
    ))
 
# Add annotations to the figure
tsne_fig_1D.update_layout(annotations=annotations)

# tsne_fig_1D.show()
 
# End the timer
end_time = time.time()
 
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.4f} seconds")
Time taken: 0.6963 seconds

t-SNE - 2D

In [24]:
import warnings
warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply t-SNE with 2D
tsne_embeddings_2D = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42).fit_transform(embeddings)

# Measure time after t-SNE
tsne_time = time.time()
print(f"t-SNE 2D: {tsne_time - start_time:.2f} seconds")

# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_2D[:, 0],
    't-SNE Dimension 2': tsne_embeddings_2D[:, 1],
    'Name': name,
}
tsne_result_df = pd.DataFrame(tsne_result_data)

# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1', 't-SNE Dimension 2']].values
distance_matrix = pdist(X, metric='euclidean')

Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_2D = fcluster(Z, t=8, criterion='maxclust')

# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_2D

# Save results to CSV
tsne_output_csv_path = 'tsne_industry_2D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
tsne_result_df['Legend'] = 'Other Nodes'
tsne_result_df.loc[tsne_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 2D t-SNE embeddings
tsne_fig_2D = px.scatter(
    tsne_result_df, 
    x='t-SNE Dimension 1', 
    y='t-SNE Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='t-SNE Projection (2D)',
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = tsne_result_df[tsne_result_df['Name'] == node]
    if not node_data.empty:
        tsne_fig_2D.add_annotation(
            x=node_data['t-SNE Dimension 1'].values[0],
            y=node_data['t-SNE Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
tsne_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2'
)

# Save the DataFrame to a CSV file
tsne_output_csv_path = 'tsne_industry_2D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# tsne_fig_2D.show()
t-SNE 2D: 2.04 seconds

t-SNE - 3D

In [13]:
import time
import pandas as pd
import plotly.graph_objects as go
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import warnings

warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply t-SNE with 3D
tsne_embeddings_3D = TSNE(n_components=3, perplexity=30, learning_rate=200, n_iter=1000, random_state=42).fit_transform(embeddings)

# Measure time after t-SNE
tsne_time = time.time()
print(f"t-SNE 3D: {tsne_time - start_time:.2f} seconds")

# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_3D[:, 0],
    't-SNE Dimension 2': tsne_embeddings_3D[:, 1],
    't-SNE Dimension 3': tsne_embeddings_3D[:, 2],
    'Name': name
}
tsne_result_df = pd.DataFrame(tsne_result_data)

# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1', 't-SNE Dimension 2', 't-SNE Dimension 3']].values
distance_matrix = pdist(X, metric='euclidean')

Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_3D = fcluster(Z, t=8, criterion='maxclust')

# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_3D

# Save results to CSV
tsne_output_csv_path = 'tsne_industry_3D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
tsne_result_df['Legend'] = 'Other Nodes'
tsne_result_df.loc[tsne_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create the interactive 3D scatter plot using go.Scatter3d
tsne_fig_3D = go.Figure()

# Add trace for other nodes
tsne_fig_3D.add_trace(
    go.Scatter3d(
        x=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 1'],
        y=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 2'],
        z=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 3'],
        mode='markers',
        marker=dict(size=2,color='blue'),  # Adjusted size for other nodes
        name='Other Nodes'
    )
)

# Add trace for highlighted nodes
tsne_fig_3D.add_trace(
    go.Scatter3d(
        x=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 1'],
        y=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 2'],
        z=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 3'],
        mode='markers+text',
        marker=dict(size=6,color='red'),  
        text=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['Name'],
        textposition='top center',
        name='Highlighted Nodes'
    )
)

# Update layout for better readability
tsne_fig_3D.update_layout(
    showlegend=False,  
    scene=dict(
        xaxis_title='t-SNE Dimension 1',
        yaxis_title='t-SNE Dimension 2',
        zaxis_title='t-SNE Dimension 3'
    ),
    title='t-SNE Projection (3D)',

)
# tsne_fig_3D.show()

UMAP for Torus Embeddings

UMAP - 1D

In [10]:
import time
import pandas as pd
import umap.umap_ as umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import plotly.graph_objects as go

# Start time measurement
start_time = time.time()

# Apply UMAP for 1D
umap_embeddings_1D = umap.UMAP(n_components=1, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 1D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_1D[:, 0],
    'Name': name,
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_1D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_1D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_1D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 1D UMAP embeddings
umap_fig_1D = px.scatter(
    umap_result_df, 
    x='UMAP Dimension 1', 
    y=[0] * len(umap_result_df),  
    hover_name='Name', 
    color='Legend',  
    title='UMAP Projection (1D)',
    hover_data={'UMAP_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = umap_result_df[umap_result_df['Name'] == node]
    if not node_data.empty:
        umap_fig_1D.add_annotation(
            x=node_data['UMAP Dimension 1'].values[0],
            y=0,  # y is 0 for 1D
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
umap_fig_1D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='UMAP Dimension 1',
    yaxis_title='y'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_1D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 1D: 4.88 seconds

UMAP - 2D

In [11]:
import time
import pandas as pd
import umap.umap_ as umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import plotly.graph_objects as go

# Start time measurement
start_time = time.time()

# Apply UMAP for 2D
umap_embeddings_2D = umap.UMAP(n_components=2, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 2D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_2D[:, 0],
    'UMAP Dimension 2': umap_embeddings_2D[:, 1],
    'Name': name,
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1', 'UMAP Dimension 2']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_2D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_2D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_2D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 2D UMAP embeddings
umap_fig_2D = px.scatter(
    umap_result_df, 
    x='UMAP Dimension 1', 
    y='UMAP Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='UMAP Projection (2D)',
    hover_data={'UMAP_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = umap_result_df[umap_result_df['Name'] == node]
    if not node_data.empty:
        umap_fig_2D.add_annotation(
            x=node_data['UMAP Dimension 1'].values[0],
            y=node_data['UMAP Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
umap_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='UMAP Dimension 1',
    yaxis_title='UMAP Dimension 2'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_2D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 2D: 2.36 seconds

UMAP - 3D

In [14]:
import time
import pandas as pd
import plotly.graph_objects as go
import umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import warnings

warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply UMAP for 3D
umap_embeddings_3D = umap.UMAP(n_components=3, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 3D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_3D[:, 0],
    'UMAP Dimension 2': umap_embeddings_3D[:, 1],
    'UMAP Dimension 3': umap_embeddings_3D[:, 2],
    'Name': name,
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1', 'UMAP Dimension 2', 'UMAP Dimension 3']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_3D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_3D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_3D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create the interactive 3D scatter plot using go.Scatter3d
umap_fig_3D = go.Figure()

# Add trace for other nodes
umap_fig_3D.add_trace(
    go.Scatter3d(
        x=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 1'],
        y=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 2'],
        z=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 3'],
        mode='markers',
        marker=dict(size=2, color='blue'),  # Adjusted size for other nodes
        name='Other Nodes'
    )
)

# Add trace for highlighted nodes
umap_fig_3D.add_trace(
    go.Scatter3d(
        x=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 1'],
        y=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 2'],
        z=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 3'],
        mode='markers+text',
        marker=dict(size=6, color='red'),  # Larger size for highlighted nodes
        text=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['Name'],
        textposition='top center',
        name='Highlighted Nodes'
    )
)

# Update layout for better readability
umap_fig_3D.update_layout(
    title='UMAP Projection (3D)',
    scene=dict(
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        zaxis_title='UMAP Dimension 3'
    ),
    showlegend=False,  # Hide the legend
    legend_title_text='Node Type'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_3D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 3D: 2.13 seconds

Sammon's Mapping for Torus Embeddings

Sammon's Mapping - 1D

In [16]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import time

# Start time measurement
start_time = time.time()

# Apply MDS for 1D
mds = MDS(n_components=1, max_iter=1000)
mds_embeddings_1D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"Sammon's 1D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data_1D = {
    'Sammon Dimension 1': mds_embeddings_1D[:, 0],
    'Name': name,

}
mds_result_df_1D = pd.DataFrame(mds_result_data_1D)

# Perform hierarchical clustering
X_mds_1D = mds_result_df_1D[['Sammon Dimension 1']].values
distance_matrix_mds_1D = pdist(X_mds_1D, metric='euclidean')

Z_mds_1D = linkage(distance_matrix_mds_1D, method='ward')
cluster_labels_mds_1D = fcluster(Z_mds_1D, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df_1D['MDS_Cluster'] = cluster_labels_mds_1D

# Save MDS result to CSV
mds_output_csv_path_1D = 'mds_industry_1D.csv'
mds_result_df_1D.to_csv(mds_output_csv_path_1D, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
mds_result_df_1D['Legend'] = 'Other Nodes'
mds_result_df_1D.loc[mds_result_df_1D['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for MDS embeddings
mds_fig_1D = px.scatter(
    mds_result_df_1D, 
    x='Sammon Dimension 1', 
    y=[0] * len(mds_result_df_1D),  # Since it's 1D, y will be constant
    hover_name='Name', 
    color='Legend',  
    title='Sammon Projection (1D)',
    hover_data={'MDS_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df_1D[mds_result_df_1D['Name'] == node]
    print(node_data)  # Debug print to check if node_data is correctly identified
    if not node_data.empty:
        mds_fig_1D.add_annotation(
            x=node_data['Sammon Dimension 1'].values[0],
            y=0,
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
mds_fig_1D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='Sammon Dimension 1',
    yaxis_title='y'
)

Sammon's Mapping - 2D

In [17]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px

# Start time measurement
start_time = time.time()

# Apply MDS for 2D
mds = MDS(n_components=2, max_iter=1000)
mds_embeddings_2D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"Sammons 2D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data = {
    'Sammons Dimension 1': mds_embeddings_2D[:, 0],
    'Sammons Dimension 2': mds_embeddings_2D[:, 1],
    'Name': name,
}
mds_result_df = pd.DataFrame(mds_result_data)

# Perform hierarchical clustering
X_mds = mds_result_df[['Sammons Dimension 1', 'Sammons Dimension 2']].values
distance_matrix_mds = pdist(X_mds, metric='euclidean')

Z_mds = linkage(distance_matrix_mds, method='ward')
cluster_labels_mds_2D = fcluster(Z_mds, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df['MDS_Cluster'] = cluster_labels_mds_2D

# Save MDS result to CSV
mds_output_csv_path = 'mds_industry_2D.csv'
mds_result_df.to_csv(mds_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
mds_result_df['Legend'] = 'Other Nodes'
mds_result_df.loc[mds_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for MDS embeddings
mds_fig_2D = px.scatter(
    mds_result_df, 
    x='Sammons Dimension 1', 
    y='Sammons Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='Sammons Projection (2D)',
    hover_data={'MDS_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df[mds_result_df['Name'] == node]
    if not node_data.empty:
        mds_fig_2D.add_annotation(
            x=node_data['Sammons Dimension 1'].values[0],
            y=node_data['Sammons Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
mds_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='Sammons Dimension 1',
    yaxis_title='Sammons Dimension 2'
)

# Save the DataFrame to a CSV file
mds_output_csv_path = 'mds_industry_2D.csv'
mds_result_df.to_csv(mds_output_csv_path, index=False)
Sammons 2D: 4.54 seconds

Sammon's Mapping - 3D

In [22]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.graph_objects as go
import time

# Start time measurement
start_time = time.time()

# Apply MDS for 3D
mds = MDS(n_components=3, max_iter=1000)
mds_embeddings_3D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"MDS 3D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data_3D = {
    'MDS Dimension 1': mds_embeddings_3D[:, 0],
    'MDS Dimension 2': mds_embeddings_3D[:, 1],
    'MDS Dimension 3': mds_embeddings_3D[:, 2],
    'Name': name,

}
mds_result_df_3D = pd.DataFrame(mds_result_data_3D)

# Perform hierarchical clustering
X_mds_3D = mds_result_df_3D[['MDS Dimension 1', 'MDS Dimension 2', 'MDS Dimension 3']].values
distance_matrix_mds_3D = pdist(X_mds_3D, metric='euclidean')

Z_mds_3D = linkage(distance_matrix_mds_3D, method='ward')
cluster_labels_mds_3D = fcluster(Z_mds_3D, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df_3D['MDS_Cluster'] = cluster_labels_mds_3D

# Save MDS result to CSV
mds_output_csv_path_3D = 'mds_industry_3D.csv'
mds_result_df_3D.to_csv(mds_output_csv_path_3D, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = [237, 650, 489, 469]

# Create a new column for the legend
mds_result_df_3D['Legend'] = 'Other Nodes'
mds_result_df_3D.loc[mds_result_df_3D['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create 3D scatter plot for MDS embeddings
mds_fig_3D = go.Figure()

# Add scatter plot for all nodes
mds_fig_3D.add_trace(go.Scatter3d(
    x=mds_result_df_3D['MDS Dimension 1'],
    y=mds_result_df_3D['MDS Dimension 2'],
    z=mds_result_df_3D['MDS Dimension 3'],
    mode='markers',
    marker=dict(
        size=2,
        color=mds_result_df_3D['Legend'].map({'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}),
        opacity=0.8
    ),
    text=mds_result_df_3D['Name'],
    hoverinfo='text'
))

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df_3D[mds_result_df_3D['Name'] == node]
    if not node_data.empty:
        mds_fig_3D.add_trace(go.Scatter3d(
            x=[node_data['MDS Dimension 1'].values[0]],
            y=[node_data['MDS Dimension 2'].values[0]],
            z=[node_data['MDS Dimension 3'].values[0]],
            mode='markers+text',
            marker=dict(size=6, color='red'),  # Larger size for highlighted nodes
            text=[node],
            textposition='top center'
        ))

# Update layout for better readability
mds_fig_3D.update_layout(
    title='Sammons Projection (3D)',
    scene=dict(
        xaxis_title='Sammons Dimension 1',
        yaxis_title='Sammons Dimension 2',
        zaxis_title='Sammons Dimension 3'
    ),
        showlegend=False,  # Hide the legend

)

Evaluation Metrics

In [42]:
# Define the embeddings and methods
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']
embeddings = {
    'original': embeddings,  
    'tsne': [tsne_embeddings_1D, tsne_embeddings_2D, tsne_embeddings_3D],
    'umap': [umap_embeddings_1D, umap_embeddings_2D, umap_embeddings_3D],
    'sammon': [mds_embeddings_1D, mds_embeddings_2D, mds_embeddings_3D]
}

Trustworthiness

In [14]:
from sklearn.manifold import trustworthiness
import time

# Calculate and print trustworthiness scores
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        trust_score = trustworthiness(embeddings['original'], embeddings[method][idx], n_neighbors=5)
        print(f"{method.capitalize()} Trustworthiness ({dim}): {trust_score: .4f}")
Tsne Trustworthiness (1D):  0.9963
Tsne Trustworthiness (2D):  0.9995
Tsne Trustworthiness (3D):  0.9997
Umap Trustworthiness (1D):  0.9864
Umap Trustworthiness (2D):  0.9958
Umap Trustworthiness (3D):  0.9991
Sammon Trustworthiness (1D):  0.8332
Sammon Trustworthiness (2D):  0.9524
Sammon Trustworthiness (3D):  1.0000

Continuity

In [43]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

def continuity(original_data, reduced_data, k=5):
    """
    Calculate the continuity metric to evaluate how well the local neighborhoods
    are preserved after dimensionality reduction.
    """
    # Find K nearest neighbors in the original high-dimensional space
    nbrs_original = NearestNeighbors(n_neighbors=k+1, algorithm='auto').fit(original_data)
    _, indices_original = nbrs_original.kneighbors(original_data)
    
    # Find K nearest neighbors in the reduced-dimensional space
    nbrs_reduced = NearestNeighbors(n_neighbors=k+1, algorithm='auto').fit(reduced_data)
    _, indices_reduced = nbrs_reduced.kneighbors(reduced_data)
    
    # Exclude the point itself from the neighbor list
    indices_original = indices_original[:, 1:]
    indices_reduced = indices_reduced[:, 1:]
    
    # Compute continuity rate
    continuity_rates = [
        len(set(indices_original[i]).intersection(indices_reduced[i])) / k
        for i in range(len(original_data))
    ]
    
    return np.mean(continuity_rates)

# Calculate and print continuity metrics
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        continuity_score = continuity(embeddings['original'], embeddings[method][idx], k=5)
        print(f"{method.capitalize()} Continuity ({dim}): {continuity_score: .4f}")
Tsne Continuity (1D):  0.6130
Tsne Continuity (2D):  0.8370
Tsne Continuity (3D):  0.8644
Umap Continuity (1D):  0.4370
Umap Continuity (2D):  0.6958
Umap Continuity (3D):  0.7640
Sammon Continuity (1D):  0.1246
Sammon Continuity (2D):  0.5162
Sammon Continuity (3D):  0.9882

Silhouette Score

In [55]:
from sklearn.metrics import silhouette_score

# Define your methods and dimensions
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']

cluster_labels = {
    'tsne': [cluster_labels_tsne_1D, cluster_labels_tsne_2D, cluster_labels_tsne_3D],
    'umap': [cluster_labels_umap_1D, cluster_labels_umap_2D, cluster_labels_umap_3D],
    'sammon': [cluster_labels_mds_1D, cluster_labels_mds_2D, cluster_labels_mds_3D]
}

# Calculate and print normalized silhouette scores
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        # Calculate silhouette score
        score = silhouette_score(embeddings[method][idx], cluster_labels[method][idx])
        # Normalize the silhouette score to the range [0, 1]
        normalized_score = (score + 1) / 2
        print(f"{method.capitalize()} Silhouette ({dim}): {normalized_score:.4f}")
Tsne Silhouette (1D): 0.7658
Tsne Silhouette (2D): 0.6865
Tsne Silhouette (3D): 0.6410
Umap Silhouette (1D): 0.7687
Umap Silhouette (2D): 0.6874
Umap Silhouette (3D): 0.6640
Sammon Silhouette (1D): 0.7559
Sammon Silhouette (2D): 0.6492
Sammon Silhouette (3D): 0.6039

RMSE of Distances

In [53]:
import numpy as np
from sklearn.metrics import mean_squared_error
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import MinMaxScaler

# Compute pairwise distances in original space
original_distances = squareform(pdist(embeddings['original'], metric='euclidean'))

# Initialize lists to store RMSE values
rmse_values = []

# Calculate RMSE for each method and dimension
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        reduced_distances = squareform(pdist(embeddings[method][idx], metric='euclidean'))
        rmse = np.sqrt(mean_squared_error(original_distances, reduced_distances))
        rmse_values.append(rmse)

# Normalize RMSE values to the range (0, 1)
scaler = MinMaxScaler()
rmse_values_reshaped = np.array(rmse_values).reshape(-1, 1)  # Reshape for scaler
rmse_normalized = scaler.fit_transform(rmse_values_reshaped).flatten()

# Print original and normalized RMSE values
print("\nOriginal and Normalized RMSE values:")
for i, method in enumerate(methods):
    for j, dim in enumerate(dims):
        idx = i * len(dims) + j
        print(f"{method.capitalize()} RMSE ({dim}): {rmse_values[idx]:.4f}, Normalized: {rmse_normalized[idx]:.4f}")
Original and Normalized RMSE values:
Tsne RMSE (1D): 59.6924, Normalized: 1.0000
Tsne RMSE (2D): 35.3365, Normalized: 0.5919
Tsne RMSE (3D): 7.2308, Normalized: 0.1211
Umap RMSE (1D): 10.6488, Normalized: 0.1783
Umap RMSE (2D): 1.9564, Normalized: 0.0327
Umap RMSE (3D): 1.1909, Normalized: 0.0199
Sammon RMSE (1D): 1.4838, Normalized: 0.0248
Sammon RMSE (2D): 0.3028, Normalized: 0.0050
Sammon RMSE (3D): 0.0042, Normalized: 0.0000

K-Nearest Neighbor (KNN) Retention

In [24]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
 
# Function to calculate KNN retention
def knn_retention(high_dim_embeddings, low_dim_embeddings, k=5):
    # Find k-nearest neighbors in high-dimensional space
    high_dim_nn = NearestNeighbors(n_neighbors=k).fit(high_dim_embeddings)
    high_dim_neighbors = high_dim_nn.kneighbors(high_dim_embeddings, return_distance=False)
    # Find k-nearest neighbors in low-dimensional space
    low_dim_nn = NearestNeighbors(n_neighbors=k).fit(low_dim_embeddings)
    low_dim_neighbors = low_dim_nn.kneighbors(low_dim_embeddings, return_distance=False)
    # Calculate overlap
    total_overlap = 0
    for i in range(high_dim_embeddings.shape[0]):
        overlap = np.intersect1d(high_dim_neighbors[i], low_dim_neighbors[i]).size
        total_overlap += overlap
    # Calculate average retention
    avg_retention = total_overlap / (high_dim_embeddings.shape[0] * k)
    return avg_retention
 
# Define the embeddings and methods
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']
embeddings = {
    'original': embeddings,  # Original high-dimensional embeddings
    'tsne': [tsne_embeddings_1D, tsne_embeddings_2D, tsne_embeddings_3D],
    'umap': [umap_embeddings_1D, umap_embeddings_2D, umap_embeddings_3D],
    'sammon': [mds_embeddings_1D, mds_embeddings_2D, mds_embeddings_3D]
}
 
# Compute and print KNN retention for each method and dimension
for method in methods:
    for dim in dims:
        if dim == '1D':
            reduced_embedding = embeddings[method][0]
        elif dim == '2D':
            reduced_embedding = embeddings[method][1]
        elif dim == '3D':
            reduced_embedding = embeddings[method][2]
        knn_ret_value = knn_retention(embeddings['original'], reduced_embedding)
        print(f"KNN retention between original and {method} ({dim}): {knn_ret_value:.4f}")
KNN retention between original and tsne (1D): 0.6690
KNN retention between original and tsne (2D): 0.8670
KNN retention between original and tsne (3D): 0.8886
KNN retention between original and umap (1D): 0.5160
KNN retention between original and umap (2D): 0.7404
KNN retention between original and umap (3D): 0.7986
KNN retention between original and sammon (1D): 0.2948
KNN retention between original and sammon (2D): 0.6014
KNN retention between original and sammon (3D): 0.9910

Sammon's Stress

In [49]:
from scipy.spatial.distance import pdist
import numpy as np
 
# Define the embeddings and methods
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']
embeddings = {
    'original': embeddings,  # Original high-dimensional embeddings
    'tsne': [tsne_embeddings_1D, tsne_embeddings_2D, tsne_embeddings_3D],
    'umap': [umap_embeddings_1D, umap_embeddings_2D, umap_embeddings_3D],
    'sammon': [mds_embeddings_1D, mds_embeddings_2D, mds_embeddings_3D]
}
 
# Calculate pairwise distances in the original space
original_distances = pdist(embeddings['original'], metric='euclidean')
 
# Define Sammon's Stress function
def sammon_stress(original_distances, projected_distances):
    # Filter out zero distances to avoid division by zero
    mask = original_distances > 0
    original_distances = original_distances[mask]
    projected_distances = projected_distances[mask]
    # Compute the numerator and denominator
    numerator = np.sum(((original_distances - projected_distances) / original_distances) ** 2)
    denominator = np.sum(original_distances ** 2)  # Usually just the sum of squared original distances
    stress = numerator / denominator
    return stress
 
# Calculate Sammon's Stress for each method and dimension
for method in methods:
    for dim in dims:
        # Get the projected embeddings
        projected_embeddings = embeddings[method][dims.index(dim)]
        # Calculate pairwise distances for projected embeddings
        projected_distances = pdist(projected_embeddings, metric='euclidean')
        # Calculate stress
        stress = sammon_stress(original_distances, projected_distances)
        print(f"Sammon's Stress ({method.upper()} {dim}): {stress:.4f}")
Sammon's Stress (TSNE 1D): 38.5119
Sammon's Stress (TSNE 2D): 13.6975
Sammon's Stress (TSNE 3D): 0.5801
Sammon's Stress (UMAP 1D): 1.4979
Sammon's Stress (UMAP 2D): 0.0493
Sammon's Stress (UMAP 3D): 0.0200
Sammon's Stress (SAMMON 1D): 0.0721
Sammon's Stress (SAMMON 2D): 0.0026
Sammon's Stress (SAMMON 3D): 0.0000

Visualization of 1D Projection

In [19]:
tsne_fig_1D.show(renderer='notebook')
umap_fig_1D.show(renderer='notebook')
mds_fig_1D.show(renderer='notebook')

Visualization of 2D Projection

In [25]:
tsne_fig_2D.show(renderer='notebook')
umap_fig_2D.show(renderer='notebook')
mds_fig_2D.show(renderer='notebook')

Visualization of 3D Projection

In [21]:
# tsne_fig_3D.update_traces(marker=dict(size=2)) 
tsne_fig_3D.show(renderer='notebook')
# umap_fig_3D.update_traces(marker=dict(size=2)) 
umap_fig_3D.show(renderer='notebook')
# mds_fig_3D.update_traces(marker=dict(size=2)) 
mds_fig_3D.show(renderer='notebook')

Conclusion

Metric t-SNE (1D) UMAP (1D) Sammon's Mapping (1D) t-SNE (2D) UMAP (2D) Sammon's Mapping (2D) t-SNE (3D) UMAP (3D) Sammon's Mapping (3D)
Trustworthiness 0.9963 0.9864 0.8332 0.9995 0.9958 0.9524 0.9997 0.9991 1.0000
Continuity 0.6130 0.4370 0.1246 0.8370 0.6958 0.5162 0.8644 0.7640 0.9882
Silhouette Score 0.7658 0.7687 0.7559 0.6865 0.6874 0.6492 0.6410 0.6640 0.6039
RMSE 1.0000 0.1783 0.0248 0.5919 0.0327 0.0050 0.1211 0.0199 0.0000
K-Nearest Neighbor (KNN) Retention 0.6690 0.5160 0.2948 0.8670 0.7404 0.6014 0.8886 0.7986 0.9910
Aggregate Score 0.4090 0.3059 0.3967 0.5596 0.4173 0.5428 0.4050 0.4412 0.7166

Table 1: Evaluation Scores

Sammon's Mapping (3D) has the highest aggregate score, indicating that it performs best overall in preserving data structure and quality across the evaluated metrics. It excels in Trustworthiness, Continuity, and K-Nearest Neighbor (KNN) Retention, and has the lowest RMSE, suggesting that it maintains the local and global structures well while achieving minimal reconstruction error.

UMAP (3D) shows good performance in preserving the local and global structure of the data, with good scores across most metrics. It performs particularly well in Trustworthiness and KNN Retention, and has a relatively low RMSE, indicating minimal distortion in distance preservation.